LLM 25-Day Course - Day 5: Deep Dive into Attention Mechanisms

Day 5: Deep Dive into Attention Mechanisms

The heart of the Transformer is Attention. Today we’ll fully understand everything from the meaning of Query/Key/Value to Multi-Head Attention and positional encoding by implementing them from scratch with numpy.

Intuitive Understanding of Query, Key, Value

Using a library analogy:

Query (Q): What I’m looking for (the search query)
Key (K): The title/tags of each book (the index)
Value (V): The actual content of the book (the content)

We calculate the similarity between Q and K, and then retrieve V’s content in proportion to that similarity.

Scaled Dot-Product Attention Implementation

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (seq_len, d_k) - Query
    K: (seq_len, d_k) - Key
    V: (seq_len, d_v) - Value
    mask: used in the decoder to hide future tokens
    """
    d_k = K.shape[-1]

    # Dot product then scale (large d_k leads to large dot products, making softmax extreme)
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)

    # Masking: prevent the decoder from seeing future tokens
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)

    # Generate probability distribution with Softmax
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    return np.matmul(weights, V), weights

# 4 tokens, 8 dimensions
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

# Causal mask (GPT-style: only attend to previous tokens)
causal_mask = np.tril(np.ones((seq_len, seq_len)))
print(f"Causal mask:\n{causal_mask.astype(int)}")

output, weights = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print(f"Attention weights:\n{weights.round(3)}")

Multi-Head Attention Implementation

def multi_head_attention(x, num_heads, d_model):
    """
    Run multiple attention heads in parallel.
    Each head learns different relationship patterns.
    """
    d_k = d_model // num_heads
    seq_len = x.shape[0]
    outputs = []

    for head in range(num_heads):
        # Separate Q, K, V projection weights for each head
        W_q = np.random.randn(d_model, d_k) * 0.1
        W_k = np.random.randn(d_model, d_k) * 0.1
        W_v = np.random.randn(d_model, d_k) * 0.1

        Q = np.matmul(x, W_q)
        K = np.matmul(x, W_k)
        V = np.matmul(x, W_v)

        head_output, _ = scaled_dot_product_attention(Q, K, V)
        outputs.append(head_output)

    # Concatenate outputs from all heads
    concatenated = np.concatenate(outputs, axis=-1)

    # Final linear projection
    W_o = np.random.randn(d_model, d_model) * 0.1
    return np.matmul(concatenated, W_o)

d_model = 64
num_heads = 8  # 64 / 8 = 8 dimensions per head
x = np.random.randn(4, d_model)

output = multi_head_attention(x, num_heads, d_model)
print(f"Multi-Head Attention output: {output.shape}")
# Head 1: subject-verb relationships, Head 2: adjective-noun relationships, ...

Positional Encoding: Sinusoidal vs RoPE

import numpy as np

def sinusoidal_position_encoding(max_len, d_model):
    """Original Transformer positional encoding"""
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    pe[:, 0::2] = np.sin(position * div_term)  # Even dimensions: sin
    pe[:, 1::2] = np.cos(position * div_term)  # Odd dimensions: cos
    return pe

def apply_rope(x, position):
    """RoPE (Rotary Position Embedding) - used in Llama, GPT-NeoX, etc."""
    d = x.shape[-1]
    freqs = 1.0 / (10000 ** (np.arange(0, d, 2) / d))
    angles = position * freqs

    # Rotate even/odd dimensions
    cos_vals = np.cos(angles)
    sin_vals = np.sin(angles)
    x_even, x_odd = x[..., 0::2], x[..., 1::2]
    rotated_even = x_even * cos_vals - x_odd * sin_vals
    rotated_odd = x_even * sin_vals + x_odd * cos_vals

    result = np.zeros_like(x)
    result[..., 0::2] = rotated_even
    result[..., 1::2] = rotated_odd
    return result

# Verify sinusoidal positional encoding
pe = sinusoidal_position_encoding(max_len=10, d_model=16)
print(f"Positional encoding shape: {pe.shape}")
print(f"Distance between position 0 and 1: {np.linalg.norm(pe[0] - pe[1]):.3f}")
print(f"Distance between position 0 and 9: {np.linalg.norm(pe[0] - pe[9]):.3f}")
# Closer positions have more similar vectors

Attention learns “which token should attend to which other token.” Multi-Head performs this simultaneously from multiple perspectives, creating richer representations.

Today’s Exercises

Explain mathematically why we scale by np.sqrt(d_k). Experiment with d_k=64 and compare the softmax output with and without scaling.
Change the number of heads in Multi-Head Attention to 1, 4, 8, and 16, and observe how d_k changes. What problems arise when there are too many heads?
Summarize the key differences between sinusoidal positional encoding and RoPE, and research why modern models prefer RoPE.